Demonstrating Mixture Models

Liberally adapted from Antonino Ingargiola's github page.

This notebook is originally from Dr. Chandola's version of 474/574

Need for mixture models

We first demonstrate the need for a mixture model over a single model using a simple one dimensional example. Note that fitting a single Gaussian works well if the observed data "follows" that assumption. However, if the data has two (or more modes), using a single Gaussian does not work.

Sampling from a mixture model

Learning parameters for a distribution using MLE

The model for this sample is the linear combination of two Gaussian PDF:

$$p(x|\theta) = \frac{\pi_1}{\sigma_1\sqrt{2\pi}} {\rm exp} \left\{ -\frac{(x-\mu_1)^2}{2\sigma_1^2} \right\} + \frac{\pi_2}{\sigma_2\sqrt{2\pi}} {\rm exp} \left\{ -\frac{(x-\mu_2)^2}{2\sigma_2^2} \right\}$$

where $\theta = [\mu_1, \sigma_1, \mu_2, \sigma_2, \pi_1]$. Note that $\pi_2$ is not included in $\theta$ since $\pi_2 = 1-\pi_1$.

In python we can define $f(x|\theta)$ using normpdf() implemented by Numpy:

Maximum Likelihood: direct maximization

We first see how to estimate parameters for the mixture using a direct maxmization of the log likelihood.

Given a sample $X = \{x_i\}$ of size $N$ extracted from the mixture distribution, the likelihood function is

$$\mathcal{L(\theta,X)} = \prod_i p(x_i|\theta)$$

and the log-likelihood function is:

$$\ln \mathcal{L(\theta,X)} = \sum_i \ln p(x_i|\theta)$$

Now, since $p(\cdot)$ is the sum of two terms, the term $\log p(x_i|\theta)$ can't be simplified (it's the log of a sum). So for each $x_i$ we must compute the log of the sum of two exponetial. It's clear that not only the computation will be slow but also the numerical errors will be amplified. Moreover, often the likelihood function has local maxima other than the global one (in other terms the function is not convex).

In python the log-likelihood function can be defined as:

We now try to minimize using several options in the numpy.scipy.optimize function.

Why is the mixture model log likelihood difficult to optimize

To answer this let us take the above data set with two modes. Let us assume that we know the true value for $\pi$, $\sigma_1$ and $\sigma_2$. Now, we will compute the negative log likelihood (the quantity that we want to minimize) for various values of $\mu_1$ and $\mu_2$. Remember that the true parameter values are: 0.3, 0.0, 0.08, 0.6, 0.12 for $\pi,\mu_1,\sigma_1,\mu_2,\sigma_2$, respectively.

Expectation Maximization

Now we will learn the parameters using the EM procedure for learning Gaussian mixtures.

Starting from the PDF $p_1()$ and $p_2()$ of the single components:

$$p_1(x|\theta) = \frac{1}{\sigma_1\sqrt{2\pi}} {\rm exp} \left\{ -\frac{(x-\mu_1)^2}{2\sigma_1^2} \right\} \qquad p_2(x|\theta) = \frac{1}{\sigma_2\sqrt{2\pi}} {\rm exp} \left\{ -\frac{(x-\mu_2)^2}{2\sigma_2^2} \right\} $$

the mixture PDF is:

$$ p(x|\theta) = \pi_1 p(x|\theta_1) + \pi_2 p(x|\theta_2) $$

If we know (or guess initially) the parameters $\theta$, we can compute for each sample and each component the responsibility function defined as:

$$\gamma(i, k) = \frac{\pi_kp(x_i|\theta_k)}{p(x_i|\theta)}$$

and starting from the "effective" number of samples for each category ($N_k$) we can compute the "new" estimation of parameters:

$$N_k = \sum_{i=1}^N \gamma(i, k) \qquad\qquad k=1,2 \quad ({\rm note\;that} \quad N_1 + N_2 = N) $$$$\mu_k^{new} = \frac{1}{N_k}\sum_{i=1}^N \gamma(i, k) \cdot s_i$$$$\sigma_k^{2\,new} = \frac{1}{N_k}\sum_{i=1}^N \gamma(i, k) \cdot (s_i - \mu_k^{new})^2$$$$\pi_k^{new} = \frac{N_k}{N}$$

Now we just loop

  1. recompute $\gamma(i, k)$
  2. estimate updated parameters

until convergence.

Animated view of 2d mixture learning